Todo

1 Introduction

1.1 Problem description

When dealing with modern large datasets, the observations’ dimensionality often becomes a problem regarding the interpretability of the dataset, or for its vizualisation, especially during the data exploration phase. The Principal Component Analysis (PCA) is a statistical technique used to reduce the dimension of the dataset, while trying to keep as much information as possible from the original dataset (i.e. variance). An example of PCA usage is bringing a high-dimension dataset to \(\mathbb{R}^2\) or \(\mathbb{R}^3\) to vizualise its datapoints and afterwards identifying clusters. It does so by computing Principal components from the dataset and then projecting datapoints to this new basis.

Mathematically speaking, we can define principal components as an orthonormal basis of unit vectors which sequentially capture the most variance in the data. Practically, the PCs can be found with the following optimization problems for a given centered Gaussian dataset \(\mathbf x_1,\ldots,\mathbf x_N \in \mathbb{R}^p\): \[ \begin{split} v_1 &= \underset{v, \| v \|=1}{\mathrm{arg\,max}} \mbox{ } v^\top \widehat{\boldsymbol{\Sigma}} v \\ v_2 &= \underset{v, \| v \|=1, v^\top v_1 = 0}{\mathrm{arg\,max}} v^\top \widehat{\boldsymbol{\Sigma}} v \\ &\;\;\vdots \end{split} \]

with \(\widehat{\Sigma} = N^{-1}\sum_{n=1}^N \mathbf x_n \mathbf x_n^\top \in \mathbb{R}^{p\times p}\) being the empirical covariance matrix of the dataset. More information about Principal component analysis can be found in Jolliffe, 2002 or directly on Wikipedia.

Example of 2 PCs on a given dataset in $R^2$

Figure 1.1: Example of 2 PCs on a given dataset in \(R^2\)

This project focuses on finding the optimal number of principal components (\(r\)), which dictates a trade-off between the model’s complexity and its interpretability. In fact, higher the \(r\), higher the explained variance, but less interpretable the dataset. To optimally choose the number of PCs, several techniques have been created, such as the method of percentage of variance explained, the scree-plot method or Cross-validation for PCA.

In this report, one will first find a presentation of each Cross-Validation method for PCA with an initial description, followed by a pseudo-code in [2. Cross-validation on PCA methods]. Then, those methods will be ran on simulated datasets with different parameters in [3. Simulation and comparison of CV methods]. Furthermore, the percentage of variance explained and the scree-plots methods will be described and tested in [4. Other PCs number selection methods]. Finally, the conclusion will wrap up the take-home messages as well as next steps which could be conducted regarding this subject.

2 Cross-validation on PCA methods

2.1 Naïve Approach

A widely used approach for Cross-validation in PCA is a method that has the right intentions but fails at a critical point: it does not take into account any variance-bias tradeoff, meaning that the optimal solution will always be \(r = R\). We start with the usual conventions by denoting \(X \in \mathbb{R}^{n \times p}\) our data with \(n\) observations and \(p\) variables, then as presented in Notes 06 - T. Masak, below is the pseudo-code of the Naïve approach:

  • split data \(\mathbf{X}\) into \(K\) folds \(J_1,\ldots,J_K\) (row indices)
  • for \(k=1,\ldots,K\):
    • solve \(\widehat{\mathbf{L}} = \underset{\mathrm{rank}(\mathbf L) = r}{\mathrm{arg\,min}} \|\mathbf X[J_k^c,:] -\mathbf L \|_2^2\)
    • calculate \(Err_k(r) = \frac{1}{|J_k|}\sum_{m \in J_k} \| x^m - P_{\widehat{\mathbf{L}}} x^m \|_2^2\)
  • end for
  • choose \(\widehat{r} = \underset{r}{\mathrm{arg\,min}} \sum_{k=1}^K Err_k(r)\)

A numerical method to solve the least square problem in the top of the algorithm would be to compute the SVD decomposition of \(\mathbf X[J_k^c,:]\) and truncate the rank to \(r\) by setting the \(p-r\) smallest singular values to zero. To estimate \(\mathbf{x^{m}}, \forall m \in J_k\) with a lower dimensional version, we project it with the matrix \(P_{\mathbf{\widehat{L}}}\) which can be computed via the the R built-in function prcomp() the returns among other objects a basis for the dimension of \(\widehat{\mathbf L}\). The promised issue occurs when computing the error measure without further to-do. Thus the error obviously shrinks when \(\widehat{\mathbf L}\) is allowed to have higher dimensions and converges to \(0\) when its number of dimensions equals those of the initial data set \(\mathbf{X}\), as the SVD will be a nearly exact representation of the matrix \(\mathbf{X}\).

2.2 Artificially turn the unsupervised problem into a supervised one

This method corrects the [2.1 Naïve Approach] by introducing a Bias-Variance trade-off for the \(Err_k\) function. Let again \(\mathbf{X} \in \mathbb{R}^{n \times p}\) be our Multivariate Gaussian dataset with \(n\) observations and \(p\) variables, mean \(\mu\) and covariance \(\Sigma\). As described in Notes 06 - T. Masak, we implement this method as below :

  • split data into \(K\) folds \(J_1,\ldots,J_K\) (row indices)
  • for \(k=1,\ldots,K\):
    • Compute \(\hat{\mu}\) and \(\hat{\Sigma} := \hat{\Sigma}_r\) (the empirical estimator truncated to rank r) on \(\mathbf{X}\setminus J_k\).
    • Split data points \(\mathbf{x}^m\in \mathbb{R}^{p}, m \in J_k\) into a “missing” part \(\mathbf{x}_{miss}\) that will be used for validation and an “observed” part \(\mathbf{x}_{obs}\) in order to estimate its counterpart
    • Predict the missing part from the observed part using the \(\hat{\mu}\) and \(\hat{\Sigma}_r\)
    • That is compute \(\hat{\mathbf{X}}^m_{miss} = \mathbb{E}_{\hat{\mu},\hat{\Sigma}_r}[\mathbf{X}^m_{miss}|\mathbf{X}^m_{obs}=\mathbf{x}^m_{obs}] = \hat{\mu}_{miss} + \hat{\Sigma}_{miss,obs}\hat{\Sigma}^\dagger_{obs}(\mathbf{X}^m_{obs}-\hat{\mu}_{obs})\), where \(\hat{\Sigma}^\dagger_{obs}\) is the pseudoinverse of \(\hat{\Sigma}_{obs,obs}\) and \(\hat{\mu}_{miss},\hat{\mu}_{obs}\) are \(\hat{\mu}\) split according to the “missed” and “observed” variable indices.
  • end for
  • choose \(\widehat{r} = \underset{r}{\mathrm{arg\,min}} \sum_{k=1}^K Err_k(r)\), where for instance \(Err_k(r)\) can be considered as the MSE for every fold \(J_k\), e.g. \(Err_k(r)=\frac{1}{|J_k|}\sum_{m\in J_k} \|\hat{\mathbf{X}}^m_{miss}-\mathbf{X}^m_{miss}\|^2_2\)

It is important to emphasize some points of this aforementioned algorithm. When we truncate \(\hat{\Sigma}\) to \(\hat{\Sigma}_r\) in every fold, we do it by eigenvalue decomposition, then store \(\alpha = \sum_{i=r+1}^p \lambda_i\), and then set to zero those \(p-r\) smallest eigenvalues. Afterwards, we compute \(\hat{\Sigma}_r = U\tilde{V}U^T\) and then add \(\alpha / p\) to its diagonal. This is done to smooth the drop of eigenvalues when truncating the covariance estimator and avoid undesirable results of our algorithms.

Hence, we reduce the rank of the covariance of \(\mathbf{X}\), but we preserve its dimension which allows us to estimate \(\mathbf{x}^m_{miss}\) in the later part of the algorithm. When splitting the observations in every iteration of the for loop, it naturally comes to mind if we should always take the same or a different (random) split. The first possibility was chosen as that additional random factor would be negligible by the SLLN. Nevertheless, it makes sense to keep two equally sized parts in order to have a proper estimation basis without overestimating. This also gives us a good understanding whether our estimation method will work in harsh conditions, i.e. if half of the data is missing. When estimating \(\hat{\mathbf{x}}^m_{miss}\), the linear estimator is used as it emerges from the law of our data set \(\mathbf{X}\). Hence, it is only applicable in this form to Gaussian distributed observations \(\mathbf{x} \in \mathbb{R}^p\). The estimation also makes use of the pseudoinverse of \(\hat{\Sigma}_{obs,obs}\) denoted by \(\hat{\Sigma}^\dagger_{obs}\). We use this convention in order to avoid singularity issues when inverting a block of the truncated covariance estimator. The pseudoinverse is easily callable in R by the function ginv() embedded in the MASS package.

2.3 Missing data approach

Here it is assumed that the rows \(\mathbf{X}_{n} \in \mathbb{R}^p, n=1,…,N\) of the data matrix \(\mathbf{X}=(\mathbf{X}_{n,j})^{N,p}_{n,j=1}\) are i.i.d. realizations of a multivariate Gaussian random variable of mean \(\mu \in \mathbb{R}^p\) and a covariance \(\Sigma \in \mathbb{R}^{p \times p}\). We will assume that rank\((\Sigma)=r\) for \(r=1,…,R\), and compare how well different values of \(r\) fit the data. As described in Notes 06 - T. Masak, this approach is based on computing the estimators \(\hat{\Sigma}\) and \(\hat{\mu}\) based on observed data by the EM algorithm while the set of missed and observed data is randomly selected at the beginning of the algorithm. Then, it performs more or less the same steps as the previous algorithm to find the optimal number of PCs. Let us now describe the EM algorithm used in our case and then describe the Missing data approach algorithm.

Let \(\tilde{X}\) be the dataset containing missing values randomly placed on the original \(N \times p\) dataset. The EM algorithm works as follows :

  • Initialize \(\hat{\mu} = \frac{1}{N}\sum_{n=1}^N(\tilde{X}^{obs}_{n})\) and \(\hat{\Sigma} = I_{p\times p}\), where \(\tilde{X}_{obs}\) is the dataset restricted to observed values.
  • Set \(\hat{\Sigma}^{old} = C\hat{\Sigma}\), and \(tol = 0.01\) for a given constant \(C\) big enough to enter the while loop (e.g. \(C = 600\))
  • while \((||\hat{\Sigma}-\hat{\Sigma}^{old}||_F > tol||\hat{\Sigma}^{old}||_F)\) where \(||.||_F\) is the Frobenius norm
    • Set \(\hat{\Sigma}^{old} = \hat{\Sigma}\)
    • Estimate \(\tilde{X}_{miss}^{n}\) by a linear estimator : \(\tilde{X}_{miss}^n = \hat{\mu}_{miss} + \hat{\Sigma}_{miss,obs} \hat{\Sigma}_{obs}^{\dagger}(\tilde{X}_{obs}^n-\hat{\mu}_{obs})\)
    • Set \(\hat{\mu} =\frac{1}{N}\sum_{n=1}^N(\tilde{X}_{n})\), \(\hat{\Sigma} = \frac{1}{N-1}\sum_{n=1}^N (\tilde{X}_i-\hat{\mu})(\tilde{X}_i-\hat{\mu})^T\), i.e. the two MLE estimators
  • end while
  • return the estimated \(\hat{\mu}, \hat{\Sigma}\)

This method slightly differs from the one presented in Notes 05 - T. Masak, mainly due to poor results when estimating \(\hat{\Sigma}\) and light confusion about notations used in the aforementioned implementation. Another matrix approach for the EM algorithm was also tried from Machine Learning, Murphy which gave good results for \(\hat{\mu}\) but was also unsuccessful when predicting \(\hat{\Sigma}\) when \(r\) becomes too large.

Below is the pseudo-code for the Missing Data method :

  • split the dataset in \(K\) folds \(J_1,\ldots,J_K\) (row indices)
  • Compute the bivariate set of missed data \(\Omega\)
  • for \(k=1,\ldots,K\):
    • Compute \(\hat{\mu}\) and \(\hat{\Sigma}\) by the EM algorithm with the dataset \(X\) and missing indices \(\Omega\)
    • Truncate \(\hat{\Sigma}\) to a rank \(r\) matrix \(\hat{\Sigma}_r\) by eigenvalue decomposition.
    • Reuse the missed data \(\Omega\) to split the data points \(\mathbf{x}^m\in \mathbb{R}^{p}, m \in J_k\) into a “missing” part \(\mathbf{x}_{miss}\) that will be used for validation and an “observed” part \(\mathbf{x}_{obs}\) in order to estimate its counterpart
    • Estimate the missing part from the observed part with a linear estimator using the \(\hat{\mu}\) and \(\hat{\Sigma}_r\) as previously described in the last CV method.
  • end for
  • choose \(\widehat{r} = \underset{r}{\mathrm{arg\,min}} \sum_{k=1}^K Err_k(r)\), where for instance \(Err_k(r)\) can be considered as the MSE for every fold \(J_k\), e.g. \(Err_k(r)=\frac{1}{|J_k|}\sum_{m\in J_k} \|\hat{\mathbf{X}}^m_{miss}-\mathbf{X}^m_{miss}\|^2_2\)

As one can observe, this method is pretty similar to the previous one, except for the computation of the missing values method and the parameters estimation. Here, the missing values are randomly selected across the dataset and can be seen as holes in the dataset, while for the previous CV method, whole columns (aka variables) were taken as missed. Also, as we are now dealing with holes in the dataset, it was not guaranteed that there would be a fully defined estimator for \(\Sigma\), compared to the other method for which we could easily select parts of the observed and missed covariance. This forced us to use the EM algorithm for this CV method, as opposed to the other one where a simple \(mean\) and \(cov\) function call did the job.

2.4 KDE modified approach

This approach is a modification of the KDE method from Notes 06 - T. Masak. Let, as usual, \(\mathbf{X}\) be a multivariate Gaussian dataset of size \(N \times p\) The goal is to minimize w.r.t \(R\) the following quantity :

\[ \mathbb{E} \| \widehat{C}_R - C \|_F^2 = \mathbb{E}\| \widehat{C}_R \|_F - 2 \mathbb{E} \langle \widehat{C}_R, C \rangle + \| C \|_F^2 \]

with \(\widehat{C}_R\) being the truncated covariance estimator of rank \(R\), \(C\) being the covariance matrix and \(\|.\|_F\) being the Frobenius norm while \(\langle X, Y \rangle\) represents the sum of all elements from the Hadamard product of the two matrices \(X,Y\), or in mathematical terms : \(\langle X, Y \rangle := \sum_{i,j}X_{i,j}Y_{i,j}\). Assuming that the mean of \(\mathbf{X}\) is zero, we get: \[ \mathbb{E} \langle \widehat{C}_R^{(-n)}, X_n X_n^\top \rangle = \langle \underbrace{\mathbb{E}[ \widehat{C}_R^{(-n)}}_{\approx \mathbb{E} \widehat{C}_R}], \underbrace{\mathbb{E} [X_n X_n^\top}_{ = C}] \rangle \approx \mathbb{E} \langle \widehat{C}_R , X_n X_n^\top \rangle \]

Therefore, since the truncated covariance estimator is not linear we obtain an approximation, which we will plug back into the initial quantity we wish to minimize. By differentiating the quantity w.r.t \(R\), we obtain the desired ranking as it will be the only value depending on \(R\). Overall, we implement this method as follows for every rank \(R\):

  • Estimate Covariance \(\widehat{C}\) from given data \(X\)
  • Decompose \(\widehat{C}\) and set \(p-R\) smallest eigenvalues to \(0\), then rebuild it to reduce its rank to \(R\) (\(\widehat{C}_R\))
  • Set \(Err_R = \|\widehat{C}_R\|_F\)
  • for \(j\) in \(1,\dots,N\):
    • Compute \(\widehat{C}_R^{(-j)}\) by estimating the covariance of \(X^{(-j)}\), \(j\) being the missing observation in \(X\) then truncate it to rank \(R\).
    • Subtract \(\frac{2}{N}\langle\widehat{C}_R,\widehat{C}_R^{(-j)}\rangle\) from \(Err_R\)
  • end
  • Choose \(R^* = \underset{R}{\mathrm{arg\,min}} \quad Err_R\) according to the minimal error \(Err_R\)

Stepping into the for-loop we realize that it is pretty similar to a Leave-One-Out Cross-validation. In this Method we are trading off high computational cost against a stable solution. As shown in the simulations, we will often get negative values for \(Err_R\), as one of its components cannot be computed (\(\|C\|_F^2\)) but was not necessary to find the optimal \(R\).

2.5 Matrix completion method

This final methods is somehow a bit more general, as we do not need to assume any probability distribution of the data \(\mathbf{X} \in \mathbb{R}^{N\times p}\), but just that \(X\) has a finite rank. Indeed, for a given data matrix \(\mathbf{X}\), we assume that a portion of this matrix is missed and that the remaining elements are observed. Let \(\Omega\) be a bivariate index set for the observed data in \(\mathbf{X}\). The matrix completion method consists of finding a matrix \(\mathbf M\) such that :

\[ \mathrm{arg \, min}_{\mathbf{M}=(M_{ij})_{i,j=1}^{N \times p}} \sum_{(i,j) \in \Omega} (X_{ij} - M_{ij})^2 \quad \mathrm{s.t.} \quad \mathrm{rank}(\mathbf M)=R \]

This optimization method gives us the best matrix \(\mathbf M\) with rank \(R\) w.r.t least-square error on the missed data. However, finding the solution of such a optimization problem is \(NP\)-hard, so one can use the following iterative algorithm to find local best candidates for the matrix \(\mathbb M\) :

  • Select an initial candidate \(\mathbf{M}^{(0)}\), here : \(M^{(0)}\) is a set of multivariate Guassian random variables with mean \(\hat{\mu}_{obs}\) and covariance $I_{pp} where \(\hat{\mu}_{obs}\) is the empirical mean computed for each column with its respective observed entries of the matrix.
  • Set \(l=0\), \(tol = 10^{-4}\)
  • do
    • \(l += 1\)
    • Compute the SVD of \(\mathbf{M}^{(l-1)}\), i.e. \(\mathbf{M}^{(l-1)} = UDV^T\)
    • Copy \(D\) into \(\widetilde{D}\) and set the \(p-R\) diagonal elements of \(\widetilde{D}\) to zero (truncating to rank \(R\))
    • Set \(\widetilde{M} = U\widetilde{D}V^T\), the new matrix of rank \(R\).
    • Set \(\mathbf{M}^{(l)}\)as : \[ \mathbf{M}^{(l)} = \begin{cases} X_{ij} \quad \text{for } (i,j) \in \Omega \\ \widetilde{M}_{i,j} \quad \text{for } (i,j) \notin \Omega \end{cases} \]
  • while \(\|M^{(l)}-M^{(l-1)}\|_F> tol\|M^{(l)}\|_F\) and \(l < 1000\):
  • Store the error \(\sum_{(i,j) \in \Omega} (X_{ij} - M^{(l)}_{ij})^2\) and then repeat this experiment for each value of \(R\) from \(1\) to \(p\), then pick the \(R\) which gives the lowest error.

We decided multiply the initial \(\|M^{(l)}-M^{(l-1)}\|_F\) stopping criteria by \(\|M^{(l)}\|_F\) to take into account the magnitude of the elements in \(M\), and added the condition \(l<1000\) in case of the algorithm would not converge to the desired tolerance in a reasonable amount of time. In other words, the algorithm stops and returns the matrix \(M^{(l)}\) if there was a change of less than \(0.01\%\) from the previous matrix \(M^{(l-1)}\), or if we go above \(1000\) iterations.

3 Simulation and comparison of CV methods

3.1 Simulation dataset description

As described inCross-validation on PCA methods, one is now equipped with multiple approaches to recover the best amount of dimensions needed to represent data “accurately” of a data set. Intuitively, every method must have its different advantages and drawbacks. To get a better feeling on how the Cross-Validation methods behave in different circumstances, one can run a simulation study on a variety of synthesized data sets. This allows us to discover the limits of the different methods used and to conclude and recommend different Algorithms with respect to its underlying data. By construction of our Algorithms, every single one of them besides the Naïve Approach, the Matrix completion method is only applicable to Multivariate Gaussian distributed data. Therefore we restrict ourselves to this special distribution and create corresponding data in order to get meaningful and comparable results. All in all, \(9\) different data sets were considered and the results of the five algorithms were compared. Furthermore, we only consider centered Gaussian Multivariate Random variables and create three base data sets, on which we infer three kinds of different noises. The first data set is a standard Multivariate Gaussian data set while the second and third base data sets will rely on a random and structured covariance matrix respectively. We denote our base data sets as \(D^{i}_0, D^{i}_1\) and \(D^{i}_2\) for \(i \in \{1,\dots,3\}\), corresponding to each type of noise. It consists of data \(\mathbf{X}=\{\mathbf{X}^1,\dots,\mathbf{X}^N\}\) and noise \(\epsilon^{i}\). The amount of different parameters make it difficult to write get a single big picture of the problem, which is the reason we implemented a Shiny App that lets the reader tune the parameters as he pleases. We strongly recommend to check it out and test the limits of the implemented cross-validation methods.

\[ D^{i}_0 = X + \epsilon^{i}_p, \text{ where } X^j\sim\mathcal{N}(0, \mathbb{1}_{p\times p}), \forall j = 1,\dots,N\\ D^{i}_1 = X + \epsilon^{i}_p, \text{ where } X^j\sim\mathcal{N}(0, \Sigma_1), \forall j = 1,\dots,N \text{ and } \Sigma_1=M\cdot M^{T}, (M_{ij})_{1\leq i,j\leq p} \stackrel{\text{iid}}{\sim} \mathcal{N}(0,1)\\ D^{i}_2 = X + \epsilon^{i}_p, \text{ where } X^j\sim\mathcal{N}(0, \Sigma_2), \forall j = 1,\dots,n \text{ and } \Sigma_2=M\cdot M^{T}, (M_{ij})_{1\leq i,j\leq p}=\frac{(i-1)p+j}{p} \]

For the covariance \(\Sigma_2\) of \(D^{i}_2\) we will get a structured matrix that will emphasize the meaningful position of the last variables. For clarity purposes, we will give an example for \(p=3\):

\[ M = \frac{1}{9}\begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ \end{pmatrix} \implies \Sigma_2 = \frac{1}{81}\begin{pmatrix} 66 & 78 & 90 \\ 78 & 93 & 108 \\ 90 & 108 & 126 \\ \end{pmatrix} \]

In order to have valid comparisons across all of our methods and validate their correctness, we fix the rank of \(\mathbf{X}\) as follows. First, we realize that by the random construction of \(\mathbf{X}\), we can assume that it has full rank \(p\). Second, we truncate it to a specific rank \(r\), which should be the recommended number of PCs of our Algorithms. Due to asymmetry of \(\mathbf{X}\), we do so by performing a singular value decomposition of \(\mathbf{X}\) with the built-in svd() function. We will set its \(p-r\) smallest singular values to \(0\); which enables us to control its rank. Please note that \(D_2\) is already of rank \(2\) before adding noise. Having adapted \(\mathbf{X}\), we infer either uniform noise, differing noise or noise that will increase for each variable.

\[ \epsilon^{1}_p \stackrel{\text{iid}}{\sim}\mathcal{N}(0,\sigma^2\mathbb{1}_{p\times p}), \sigma=0.02*Val_{SVD}^r \text{ where } Val_{SVD}^r \text{ is the r-th singular value of } \mathbf{X}\\ \epsilon^{2}_p\stackrel{\text{iid}}{\sim}\mathcal{N}(0,\sigma^2\mathbb{1}_{p\times p}), \sigma=0.001*Val_{SVD}^r \\ \epsilon^{3}_p\stackrel{\text{iid}}{\sim}\mathcal{N}(0, U), U_{ii}\stackrel{\text{iid}}{\sim} \mathcal{U}([0,0.005*Val_{SVD}^r]) \text{ and } U_{ij}=0, \text{for } i\neq j \\ \]

We run our simulation with \(n=100\) observations per data set and \(p=8\) variables. We truncate our data set to rank \(r=3\) and repeat our cross-validation \(sim=5\) times on the seed \(1312\). The methods using classical cross-validation will loop over \(K=5\) folds. In order to take into consideration the magnitude of our dataset, we chose to multiply the noise’s variance by the \(r-th\) singular value of \(\mathbf{X}\).

3.2 Comparison with respect to noise

For this analysis, we focus on \(D_0 \in \mathbb{R}^{N \times p}\) and will vary the noise \(\epsilon^i_p\) for \(i \in [1,3]\). Hereafter are the errors for each CV method described in Section 2.

Comparison of CV methods on base 0 dataset with different noises as indicated by Noise°1,2 and 3

Figure 3.1: Comparison of CV methods on base 0 dataset with different noises as indicated by Noise°1,2 and 3

Fixing our data set to be \(D_0\) and applying our CV Methods, we get very different results for each algorithm. Please note that the \(y\)-axis in in logscale in order to get meaningful insights about the errors, especially for the Missing Data approach. First, the error of the Naïve Approach decreases continuously and achieves its best estimation when allowing the method to make use of all \(8\) variables, as theoretically expected.

The Wrong PCA Method Improved method will however counter this behavior and the error will rise after hitting the optimal \(r=3\) value. Overall this would lead the Wrong PCA Improved Method to favor rank \(3\) as estimate for the rank of our modified data set. In general we can still see that the Wrong PCA Improved has difficulties to extract the exact amount of variables, especially when the added noise to the data set is constantly high, as the high curve in yellow is less steep than the blue one with low noise. We also note that for the high noise curve, there is a tendency to revert back when \(r\) increases, which might cause problems as the biggest value \(r\) might be returned.

The Missing data approach differs in its overall behavior from the previous to methods. While the error only decreases slightly in the first and the second variable, it drops noticeably for \(r=3\) in our Simulation Study especially for low and differing noise. Exceeding rank \(3\), there is a strong increase in the error due to the \(\Sigma\) estimation of the EM-Algorithm diverging from the expected value. A possible reason for this divergence is the choice of initial covariance matrix (in our case the Identity matrix). Here, we conclude the this approach also finds the correct dimension, but one has to be careful of the amount of noise added as this challenges the algorithm quite a bit.

The error measure of the Matrix completion method seems to be a mix of the error of the Wrong PCA Improved and the Missing Data approach. We observe a slight increase in the first two variables, followed by a strong error drop on variable \(3\) before it starts to increase again. The increase after variable \(3\) is clearly stronger than at the beginning.

As for the KDE Modified Approach, it will be analysed at the end of this section due to similar behaviors with respect to different datasets.

After this initial analysis, one would like to examine how the CV methods generalize to other datasets, i.e. moving from the identity matrix as covariance to a random matrix. Hereafter are the results for \(D_1\), being the dataset with a random covariance matrix, as defined in the Simulation dataset description

Comparison of CV methods on base 1 dataset with different noises as indicated by Noise°1,2 and 3

Figure 3.2: Comparison of CV methods on base 1 dataset with different noises as indicated by Noise°1,2 and 3

<<<<<<< HEAD The [Naive Approach] seems to preserve the same shape and error size as on \(D_0\). The Wrong PCA Improved method slightly changes as it still drops in within the first three variables, but stabilizes more strongly afterwards. Again we see how the total noise added accounts for the different stabilization errors. For the Missing Data Approach error we observe a strong similarity to the previous analysis. Its error starts to decrease until variable \(3\) and then spikes up to stabilize. The only noticable difference is, that the total amount of noise added plays now a stronger criterion on where the curve converges. With the uniform low noise converging with the lowest error. In the Matrix Completion method we now observe that the algorithms doesn’t screw up its prediction with high uniform noise, but does this time for differing noise. For high noise the method favors \(3\) variables, where as it chooses \(2\) variables for differing noise. For a uniform low noise it seems to work well by, having a sharp decrease in error rank \(3\) and then increasing again.

4 We now examine to which extent our algorithms change if we consider a structured covariance \(\Sigma_2\), with increasing impact on the last variables.

The Naïve Approach seems to preserve the same shape and error size as on \(D_0\). The Wrong PCA Improved method slightly changes as it still drops in within the first three variables, but stabilizes more strongly afterwards especially for Noise n°3. Again we see how the total noise added accounts for the different stabilization error. Interestingly, the Missing Data Approach and the Matrix Completion Method aren’t able to return more precise estimation with a higher amount of variables and actually increase in error size. Both have kind of the same shape and error size, whereas the latter one seems to be a smoother version of the first one. Also the error curve sticks closely together for the uniform noise in the Missing Data Approach, whereas the differing noise starts beneath the and starts beneath >>>>>>> 42feed500abe249bc11b9e390e35b03af0f321e8

Comparison of CV methods on base 2 dataset with different noises as indicated by Noise°1,2 and 3

Figure 4.1: Comparison of CV methods on base 2 dataset with different noises as indicated by Noise°1,2 and 3

<<<<<<< HEAD The Wrong PCA has behaves again the same regardless of the noise added. The first observation also applies for the Wrong PCA improved. The only subtlety we want to emphasize on, is that it seems once again to be sensitive with the total amount of noise added. A stronger decrease in the first two variables for the uniform and differing noise is observable. Interestingly, the Missing Data Approach and the Matrix Completion Method don’t seem to handle the kind of structure very well. In fact the algorithms aren’t able to return more precise estimation with a higher amount of variables and actually increase in error size. Both have kind of the same shape and error size, whereas the latter one seems to be a smoother version of the first one. Also the error curve sticks closely together for the uniform noise in the Missing Data Approach, whereas the differing noise starts beneath the other two error curves but then exceed them after rank \(5\). The adhesivity is somewhat stronger for the error Matrix completion error as the error curve is almost identical regardless of the noise added.

KDE Approach for base 0 dataset with different noises as indicated by Noise°1,2 and 3

Figure 4.2: KDE Approach for base 0 dataset with different noises as indicated by Noise°1,2 and 3

KDE Modified Approach for the 3 datasets with different noises as indicated by Noise°1,2 and 3

Figure 4.3: KDE Modified Approach for the 3 datasets with different noises as indicated by Noise°1,2 and 3

For the KDE Modified Approach, we observe in general that for every variable until the third one it drops very strongly. Having reached rank \(3\) the quality of its estimation seem to stay the same and correspondingly the error. In case of the second data set we observe the same behaviour but instead of stabilizing after rank \(3\) it does that after reaching variable \(2\) by construction of our data set.

4.1 Comparison with respect to initial covariance matrix

Comparison of CV methods on different dataset for Noise n°1 (high noise)

Figure 4.4: Comparison of CV methods on different dataset for Noise n°1 (high noise)

COMMENTS

Comparison of CV methods on different dataset for Noise n°2 (low noise)

Figure 4.5: Comparison of CV methods on different dataset for Noise n°2 (low noise)

COMMENTS

Comparison of CV methods on different dataset for Noise n°3 (differing noise)

Figure 4.6: Comparison of CV methods on different dataset for Noise n°3 (differing noise)

5 Other PCs number selection methods

5.1 Percentage of variance explained method

The percentage of variance explained method is based on the dataset’s covariance matrix, and its eigenvalues and the total variability (sum of diagonal entries of the covariance matrix). When computing the PCs for a given dataset \(\mathbf{X} \in \mathbb{R}^{N \times p}\) of covariance \(\Sigma\), we create an orthonormal basis with a corresponding diagonal covariance matrix \(\Sigma_{PCA}\) with its entries being the eigenvalues of corresponding PCs. As we sequentially construct the components, the eigenvalues \(\lambda_1, \lambda_2, ... ,\lambda_p \in \mathbb{R}\) of the new matrix will be a decreasing sequence. The method of percentage of variance explained is simply to select the first \(r\) components such that \(\frac{1}{V}\sum_{i=1}^r \lambda_i \ge \tau\) for an arbitrary threshold \(\tau \in [0,1]\) and a total variability \(V = \sum_{i=1}^p \lambda_i = tr(\Sigma_{PCA})\). Below are a few examples on the method on \(D_0\):

Eigenvalues of the PCA covariance matrix, with a treshold of 90% for $D_0$ w.r.t noise

Figure 5.1: Eigenvalues of the PCA covariance matrix, with a treshold of 90% for \(D_0\) w.r.t noise

5.2 Scree-plot method

The Scree-plot is again related to the eigenvalues of the newly created orthonormal basis of PCs. As shown in the above, the eigenvalues corresponding to the PCs are a decreasing sequence, as seen on the graph above. The Scree-plot method consists of choosing the value of \(r\) which corresponds to the elbow of the eigenvalue graph. With \(100\) and \(8\) variables we get an overview to which extent every single value contributes to the variance of the data set as a whole.

Scree plot on $D_0$ with different noises as indicated by 1,2 and 3

Figure 5.2: Scree plot on \(D_0\) with different noises as indicated by 1,2 and 3

We can clearly see that the first three plots singular values are much bigger then the latter ones. In every plot the first three singular values only decrease slightly until the \(4\)th, where we can observe a strong drop. From the \(4\)th on the values decrease in the same manner as the first \(3\). In this circumstances the [Scree-plot method (non-automatic)] would favor the \(4\)th value as this value initiates the elbow characteristic of the graphs in every single one of the graphs. This selection would unfortunately be the wrong one as by construction, e.g. truncation of our data set we were expecting the third singular value to initiate the “elbow” form.

As seen below, the “Elbow-Plot” of our data set with a random matrix as covariance has a somehow similar shape. The first eigenvalue however explains much more on the overall variance of our data set. It drops strongly until the second, where the graph seems to stabilize. This behaviour holds on until the third singular value, whereafter we observe a similar drop and decreasing behaviour as in the previous plot. As we already concluded in Figure 5.2, the [Scree-plot method (non-automatic)] would favor a wrong value.

Scree plot on $D_1$ with structured matrix as covariance with different noises as indicated by 1,2 and 3

Figure 5.3: Scree plot on \(D_1\) with structured matrix as covariance with different noises as indicated by 1,2 and 3

The scree plot of \(D_2\), however looks more promising as we get a sharp drop after the first value for every kind of added noise. Therefore choosing the optimal rank \(r\) according to the [Scree-plot method (non-automatic)] would coincide with the true rank of data set (which was by construction \(2\) regardless of the amount of observations and variables) ++TRUE?!

Scree plot on $D_2$ with structured matrix as covariance with different noises as indicated by 1,2 and 3

Figure 5.4: Scree plot on \(D_2\) with structured matrix as covariance with different noises as indicated by 1,2 and 3

5.3 Comparison with CV methods

Compared to CV methods described in Cross-validation on PCA methods, those methods are relatively easy to implement, efficient as there is only a eigenvalue decomposition of the covariance matrix which is computationnally expensive and pretty interpretable. However, those methods are not well generalizable if there is some missing values in the dataset or if the eigenvalues do not follow a clear elbow pattern (for the scree-plot).

6 Conclusion

In conclusion, several methods for choosing the optimal number of components has been presented. We could observe the behavior of those CV methods depending on the noise and the structure of covariance matrices thanks to simulation on various datasets. We concluded that some methods are more robust to noise (EXAMPLE) while other tend to give poor results when confronted to noise (EXAMPLE). Moreover, we had to assume that our dataset was a centered multivariate Gaussian for every method to be comparable, however, some methods generalize better to other type of dataset’s distribution (Matrix Completion for example) while others are more restrictive (EXAMPLE). Computationwise, some methods are more efficient than others (EXAMPLE) and, assuming that all conditions are met to use those methods, the most efficient ones are therefore favoured when working for bigger datasets.